1 Evaluating estimators Suppose you observe data X 1,..., X n that are iid observations with distribution F θ indexed by some parameter θ. When trying to estimate θ, one may be interested in determining the properties of some estimator ˆθ of θ. In particular, the bias ) Bias(ˆθ) = E (ˆθ θ may be of interest. That is, the average difference between the estimator and the truth. Estimators with Bias(ˆθ) = 0 are called unbiased. Another (possibly more important) property of an estimator is how close it tends to be to the truth on average. The most common choice for evaluating estimator precision is the mean squared error, ) MSE(ˆθ) = E ((ˆθ θ). When comparing a number of estimators, MSE is commonly used as a measure of quality. By directly using the identity that var(y ) = E(Y ) E(Y ), where the random variable Y = ˆθ θ, the above equation becomes MSE(ˆθ) = E (ˆθ θ ) + var(ˆθ θ) = Bias(ˆθ) + var(ˆθ) where the last line follows from the definition of bias and the fact that var(ˆθ θ) = var(ˆθ), since θ is a constant. For example, if X 1,..., X n are iid N(µ, σ ), then X N(µ, σ /n). So the bias of X as an estimator of µ is and the MSE is Bias(X) = E(X µ) = µ µ = 0 MSE(X) = 0 + var(x) = σ /n The above identity says that the precision of an estimator is a combination of the bias of that estimator and the variance. Therefore it is possible for a biased estimator to be more precise than an unbiased estimator if it is significantly less variable. This is known as the bias-variance tradeoff. We will see an example of this. 1. Using monte carlo to explore properties of estimators In some cases it can be difficult to explicitly calculate the MSE for an estimator. When this happens monte carlo can be a useful alternative to a very cumbersome mathematical calculation. The example below is an instance of this. Example: Suppose X 1,..., X n are iid N(θ, θ ) and we are interested in estimation of θ. Two reasonable estimators of θ are the sample mean ˆθ 1 = 1 n n i=1 X i and the sample
standard deviation ˆθ 1 = n n 1 i=1 (X i X). To compare these two estimators by monte carlo for a specific n and θ: 1. Generate X 1,..., X n N(θ, θ ). Calculate ˆθ 1 and ˆθ 3. Save (ˆθ 1 θ) and (ˆθ θ) 4. Repeat step 1-3 k times 5. Then the means of the (ˆθ 1 θ) s and (ˆθ θ) s, over the k replicates, are the monte carlo estimators of the MSEs of ˆθ 1 and ˆθ. This basic approach can be used any time you are comparing estimators by monte carlo. The larger we choose k to be, the more accurate these estimates are. We implement this in R with the following code for θ =.5,.6,.7,..., 10, n = 50, and k = 1000. k = 1000 n = 50 # Sequence of values of theta THETA <- seq(.5, 10, by=.1) # Storage for the MSEs of each estimator MSE <- matrix(0, length(theta), ) # Loop through the values in Theta for(j in 1:length(THETA)) { } # Generate the k datasets of size n D <- matrix(rnorm(k*n, mean=theta[j], sd=theta[j]), k, n) # Calculate theta_hat1 (sample mean) for each data set ThetaHat_1 <- apply(d, 1, mean) # Calculate theta_hat (sample sd) for each data set ThetaHat_ <- apply(d, 1, sd) # Save the MSEs MSE[j,1] <- mean( (ThetaHat_1 - THETA[j])^ ) MSE[j,] <- mean( (ThetaHat_ - THETA[j])^ ) # Plot the results on the same axes plot(theta, MSE[,1], xlab=quote(theta), ylab="mse", main=expression(paste("mse for each value of ", theta)), type="l", col=, cex.lab=1.3, cex.main=1.5)
MSE for each value of θ MSE 0.0 0.5 1.0 1.5.0 4 6 8 10 θ Figure 1: Simulated values for the MSE of ˆθ 1 and ˆθ lines(theta, MSE[,], col=4) From the plot we can see that ˆθ, the sample standard deviation, is a uniformly better estimator of θ than ˆθ 1, the sample mean. We can verify this simulation mathematically. Clearly the sample mean s MSE is MSE(ˆθ 1 ) = θ /n The MSE for sample standard deviation is somewhat more difficult. It is well known that, in general, the sample variance from a normal population, V, is distributed so that (n 1)V σ χ n 1, where σ is the true variance. In this case ˆθ = V. The χ distribution with k degrees of freedom has density function p(x) = (1/)k/ Γ(k/) xk/ 1 e x/ where Γ is the gamma function. Using this we can derive the expected value of V :
( ) σ E V = n 1 E σ = n 1 ( ) (n 1)V 0 σ (1/) n 1 x x((n 1)/) 1 e x/ dx which follows from the definition of expectation and the expression above for the χ density. The trick now is to rearrange terms and factor out constants properly so that the integrand become another χ density ( ) σ (1/) n 1 E V = n 1 0 x(n/) 1 e x/ dx σ = n 1 Γ(n/) (1/) n 1 0 Γ(n/) x(n/) 1 e x/ dx σ = n 1 Γ(n/) (1/) n 1 (1/) n/ (1/) n/ 0 Γ(n/) x(n/) 1 e x/ dx }{{} χ n density Now we know that the integral in the last line is 1, since it has the form of a χ density with n degrees of freedom. The rest is just simplifying constants: ( ) σ E V = n 1 Γ(n/) (1/) n 1 (1/) n/ σ = n 1 Γ(n/) Γ(n/) = n 1 Γ( n 1 )σ = n 1 Γ(n/) σ Therefore E(ˆθ ) = Γ(n/) n 1 Γ( n 1 )θ. So the bias is ( ) Bias(ˆθ ) = θ E(ˆθ ) = θ 1 n 1 Γ(n/) To calculate the variance of ˆθ we also need E(ˆθ ). ˆθ is the sample variance, which we know is an unbiased estimator of the variance, θ, so E(ˆθ ) = θ
so the variance of ˆθ is Finally, ( var(ˆθ ) = θ 1 ) n 1 Γ(n/) Γ( n 1 ) ( MSE(ˆθ ) = θ 1 n 1 Γ(n/) Γ( n 1 ) ) = θ (1 n 1 Γ(n/) ) + ( 1 ) n 1 Γ(n/) Γ( n 1 ) It is a fact that ( ) 1 n 1 Γ(n/) < 1/n for any n. This implies that MSE(ˆθ ) < MSE(ˆθ 1 ) for any n and any θ. We can check this derivation by plotting the MSEs and comparing with the simulation based MSEs: # for each Q[1] is Theta, and Q[] is n # MSE of theta_hat1 MSE1 <- function(q) (Q[1]^)/Q[] # MSE theta_hat MSE <- function(q) { theta <- Q[1]; n <- Q[]; G <- gamma(n/)/gamma( (n-1)/ ) bias <- theta * (1 - sqrt(/(n-1)) * G ) variance <- (theta^) * (1 - (/(n-1)) * G^ ) } return(bias^ + variance) # Grid of values for Theta for n=50 THETA <- cbind(matrix( seq(.5, 10, length=100), 100, 1 ), rep(50,100)) # Storage for MSE of thetahat1 (column 1) and thetahat (column ) MSE <- matrix(0, 100, ) # MSE of theta_hat1 for each theta MSE[,1] <- apply(theta, 1, MSE1)
MSE for each value of θ MSE 0.0 0.5 1.0 1.5.0 4 6 8 10 θ Figure : True values for the MSE of ˆθ 1 and ˆθ # MSE of theta_hat for each theta MSE[,] <- apply(theta, 1, MSE) plot(theta[,1], MSE[,1], xlab=quote(theta), ylab="mse", main=expression(paste("mse for each value of ", theta)), type="l", col=, cex.lab=1.3, cex.main=1.5) lines(theta[,1], MSE[,], col=4) Clearly the conclusion is the same as the simulated case ˆθ has a lower MSE than ˆθ 1 for any value of θ, but it was far less complicated to show this by simulation. Exercise 1: Consider data X 1,..., X n iid N(µ, σ ) where we are interested in estimating σ and µ is unknown. Two possible estimators are: ˆθ 1 = 1 n n (X i X) i=1 and the conventional unbiased sample variance: ˆθ = 1 n 1 n (X i X) i=1 Estimate the MSE for each of these estimators when n = 15 for σ =.5,.6,..., 3 and evaluate which estimate is closer to the truth on average for each value of σ.
Properties of hypothesis tests Consider deciding between two competing statistical hypotheses H 0, the null hypothesis, and H 1, the alternative hypothesis based on data X 1,..., X n. A test statistic is a function of the data T = T (X 1,..., X n ) such that if T R α then you reject H 0, otherwise you do not. The space R α is called the rejection region and is chosen so that P (reject H 0 H 0 is true) = P (T R α H 0 is true) = α α is referred to as the level of the test, and is the probability of incorrectly rejecting H 0 ; α is typically chosen by the user;.05 is a common choice. For example, in a two-sided z-test of H 0 : µ = 0, when σ is known, the rejection region is R α = (, z α/ ) (z 1 α/, ) where z a denotes the a th quantile of a standard normal distribution. When α =.05 this yields the familiar rejection region (, 1.96) (1.96, ). A good hypothesis test is one that has for a small value of α, has a large Power, which is the probability of rejecting H 0 when H 0 is indeed false. When testing the hypothesis H 0 : θ = θ 0 for some specific null value θ 0, and θ true θ 0, the power is Power(θ true ) = P (T R α θ = θ true ) Some primary determinants of the power of a test are: The sample size The difference between the null value and the true value (generally referred to as effect size The variance in the observed data In many settings practioners are interested in either a) how far the true value of θ must be from θ 0 or b) for a fixed effect size, how large the sample size must be for the power to reach some nominal level, say 80%. Inquiries of this type are referred to as power analysis Example : Power of the two-sample z-test Suppose you observe X 1,..., X n iid N(µ X, σ ) and Y 1,..., Y m iid N(µ Y, σ ) where µ X, µ Y are unknown, and σ is known. We are interested in a two-sided test of the hypothesis H 0 : µ X µ Y = 0. A common statistic for testing such hypotheses is the z-statistic: ( ) n X Y T = σ It is well known that, under H 0, T has a standard normal distribution. It can be shown that, for any value of µ D = µ X µ Y, this test is the most powerful level α-level test of H 0. (Similarly, when the variances are unknown and the sample size/variances are potentially unequal, the students t-test is the most powerful α level tests of this null
hypothesis). µ D is the measure of effect size in this test, and Power(µ D ) is a monotonically increasing function. For example, if µ D is very small it intuitively that we would be less likely to reject H 0 than if µ D was large. We will investigate the power of the two-sample z-test for sample sizes n = 10, 0, 30, 40, 50 as a function of the true mean difference µ D. The larger the true σ the smaller the power will be (for a fixed n and µ D ), but we will not investigate this effect in this example. Each data set will be generated to have σ = 1, the two samples will have equal sizes, and α =.05. The basic algorithm is: 1. Generate datasets of the form X 1,..., X n N(0, 1), and Y 1,..., Y n N(µ D, 1).. Calculate T 3. Save I = I( T > z 1 α/ ) 4. Repeat k times 5. The mean of the k values of the I s is the monte carlo estimate of Power(µ D ). #alpha level alpha =.05 # number of simulation reps k <- 1000 # sample sizes n <- 10*c(1:5) # the mu_d s mu_d <- seq(0,, by=.1) # storage for the estimated Powers Power <- matrix(0, length(mu_d), 5) for(i in 1:5) { for(j in 1:length(mu_D)) { # Generate k datasets of size n[i] X <- matrix( rnorm(n[i]*k), k, n[i]) Y <- matrix( rnorm(n[i]*k, mean=mu_d[j]), k, n[i]) # Get sample means for each of the k datasets Xmeans <- apply(x, 1, mean) Ymeans <- apply(y, 1, mean)
# Calculate the Z statistics T <- sqrt(n[i])*(xmeans - Ymeans)/sqrt() # Indicators of the z-statistics being # in the rejectin region I <- (abs(t) > qnorm(1-(alpha/))) # Save the estimated power Power[j,i] <- mean(i) } } plot(mu_d, Power[,1], xlab=quote(mu(d)), ylab=expression( paste("power(", mu(d), ")")), col=, cex.lab=1.3, cex.main=1.5, main=expression(paste("power(", mu(d), ") vs.", mu(d))), type="l" ) points(mu_d, Power[,1], col=); points(mu_d, Power[,], col=3) points(mu_d, Power[,3], col=4); points(mu_d, Power[,4], col=5) points(mu_d, Power[,5], col=6); lines(mu_d, Power[,], col=3) lines(mu_d, Power[,3], col=4); lines(mu_d, Power[,4], col=5) lines(mu_d, Power[,5], col=6); abline(h=alpha) legend(1.5,.3, c("n = 10", "n = 0", "n = 30", "n = 40", "n = 50"), pch=(1), col=c(:6), lty=1) It is actually straightforward to calculate the power of the two-sample z-test. If µ D is the true mean difference, then T is a standard normal random variable but shifted over by nµd /, since E(X Y ) = µ D. Letting Z denote a standard normal random variable, the power as a function of µ D is: Power(µ D ) = P ( T > z 1 α/ ) = 1 P ( ) z α/ T z 1 α/ = 1 P (z α/ Z + nµ D / ) z 1 α/ ( = 1 P z α/ nµ D / Z z 1 α/ nµ D / ) ( ( = 1 P Z z 1 α/ nµ D / ) ( P Z z α/ nµ D / )) ( = 1 Φ(z 1 α/ nµ D / ) Φ(z α/ nµ D / ) ) where Φ denotes the standard normal CDF. Notice that as n, lim n Φ(z 1 α/ nµ D / ) = Φ( ) = 0
Power(µ(D)) vs.µ(d) Power(µ(D)) 0. 0.4 0.6 0.8 1.0 n = 10 n = 0 n = 30 n = 40 n = 50 0.0 0.5 1.0 1.5.0 µ(d) Figure 3: Simulated power of the two-sample z-test for sample sizes n = 10, 0, 30, 40, 50 and µ D ranging from 0 up to. and similarly for Φ(z α/ nµ D / ), therefore lim Power(µ D) = 1 n In other words, not matter how small µ D > 0 is, the power to detect it as significantly different from 0 goes to 1 as the sample size increases. To check this calculation we plot the theoretical power and compare it with the simulation: # alpha level alpha <-.05 # sample sizes n <- 10*c(1:5) # the mu_d s mu_d <- seq(0,, by=.1) # storage for the true Powers Power <- matrix(0, length(mu_d), 5) for(i in 1:5) { for(j in 1:length(mu_D)) {
Power(µ(D)) vs.µ(d) Power(µ(D)) 0. 0.4 0.6 0.8 1.0 n = 10 n = 0 n = 30 n = 40 n = 50 0.0 0.5 1.0 1.5.0 µ(d) Figure 4: Theoretical power of the two-sample z-test for sample sizes n = 10, 0, 30, 40, 50 and µ D ranging from 0 up to. } } Power[j,i] <- 1 - ( pnorm( qnorm(1-alpha/) - sqrt(n[i])*mu_d[j]/sqrt() ) - pnorm( qnorm(alpha/) - sqrt(n[i])*mu_d[j]/sqrt() ) ) # plot the results plot(mu_d, Power[,1], xlab=quote(mu(d)), ylab=expression( paste("power(", mu(d), ")")), col=, cex.lab=1.3, cex.main=1.5, main=expression(paste("power(", mu(d), ") vs.", mu(d))), type="l" ) points(mu_d, Power[,1], col=); points(mu_d, Power[,], col=3) points(mu_d, Power[,3], col=4); points(mu_d, Power[,4], col=5) points(mu_d, Power[,5], col=6); lines(mu_d, Power[,], col=3) lines(mu_d, Power[,3], col=4); lines(mu_d, Power[,4], col=5) lines(mu_d, Power[,5], col=6); abline(h=alpha) legend(1.5,.3, c("n = 10", "n = 0", "n = 30", "n = 40", "n = 50"), pch=(1), col=c(:6), lty=1) We can see the theoretical calculation matches the simulation. In this case the power calculation is simple, but for most hypothesis tests, power calculations are intractable, so simulation based power analysis is the only option.
Exercise : Using a similar approach to the above, consider the same problem except X 1,..., X n N(µ X, σx ) and Y 1,..., Y n N(µ Y, σy ) (both equal sample size) where σ X, σ Y are not known and but are assumed to be equal. Use the statistic ( ) n X Y T = ˆσ X + ˆσ Y where ˆσ X and ˆσ Y are the unbiased sample variances from exercise 1 calculated for each set of data. Under H 0, T has a t-distribution with n degrees of freedom. Estimate the power of this test for sample sizes n = 10, 0, 30, 40, 50 and for the true µ X µ Y ranging from 0 up to. In this case the theoretical power calculation, although possible, is significantly more difficult.